Improve `__crystal_once` performance #15216

BlobCodes · 2024-11-23T04:32:51Z

Description

This PR significantly improves __crystal_once performance.

I originally wrote this code for a custom crystal stdlib focused on microchips without malloc.

While the previous implementation used an array to keep track of all variables currently being initialized, this one (ab)uses the given flag boolean pointer as an enum with three values to represent the "being initialized" state.

Also, the previous implementation always used a mutex when accessing any const variable using -Dpreview_mt.
This implementation instead has a fast path for the (very likely) scenario that the variable was already initialized which doesn't need a mutex.

Let's talk numbers

Benchmark code and results on my machine can be found here:
https://gitlab.com/-/snippets/4772439

HertzDevil · 2024-11-23T04:52:44Z

Codegen still declares an i1 for the flag in Crystal::CodeGenVisitor#declare_const_initialized_flag, there will probably be undefined behavior unless this is also changed

BlobCodes · 2024-11-23T04:58:08Z

Codegen still declares an i1 for the flag in Crystal::CodeGenVisitor#declare_const_initialized_flag, there will probably be undefined behavior unless this is also changed

I'm also working on the codegen part (primarily to improve performance), but I think this shouldn't have any undefined behavior since the pointer is always cast into the enum type before accessing it, and a boolean has to be at least one byte in size.

HertzDevil · 2024-11-23T05:27:38Z

A cast doesn't cut it because LLVM is free to assume the unused bits in the flags are poison.

straight-shoota · 2024-11-23T11:35:48Z

The idea sounds great, but the implementation must be sound.
I suppose we could change to a bigger integer type for real?

Also related to #14905 and #15085 which might want to use a similar mechanism.

src/crystal/once.cr

ysbaddaden · 2024-11-23T13:23:28Z

@BlobCodes Very nice, thank you! 😍

@HertzDevil By poison you mean that the cast may feel safe, but that we'd be assuming some undefined behavior? LLVM doesn't make guarantees that the 7 other bits will be carried over, just that the value for an i1 can be 0 or 1?

@straight-shoota That sounds safer, that'd need a new __crystal_once2 fun (for backward compiler compatibility) and to keep the slow legacy initialization.

Co-authored-by: Sijawusz Pur Rahnama <[email protected]>

straight-shoota · 2024-11-23T16:46:12Z

need a new __crystal_once2 fun (for backward compiler compatibility) and to keep the slow legacy initialization.

We should keep the existing implementation around for compatibilty with older compilers.
But it might not be necessary to introduce a new fun. The symbol is not even used anywhere because it should always get inlined. And the compiler knows whether the second parameter is Bool or UInt8, so it can generate code approriately.

I'm wondering why this is even a fun when it's not intended to be exported. A def should be fine for this then.

BlobCodes · 2024-11-23T19:01:15Z

A cast doesn't cut it because LLVM is free to assume the unused bits in the flags are poison.

The main performance improvement is from the inlined fast-check, which is also possible without using the enum.

Removing the Array usage was primarily done to use __crystal_once together with --prelude=empty.
I could also create another PR with only the performance improvement.

That sounds safer, that'd need a new __crystal_once2 fun (for backward compiler compatibility) and to keep the slow legacy initialization.

Creating a new __crystal_once2 could reduce the overhead of the once mechanism even further, so that's a good idea anyways.

straight-shoota · 2024-11-23T19:09:47Z

Sounds like a great idea to separate the two changes into separate PRs. That's in general, and also in particular if we can merge one without any prerequisites.

@BlobCodes

Based on the PR by @BlobCodes: crystal-lang#15216 The performance improvement is the usage of a i8 instead of an i1 boolean to have 3 states instead of 2, which permits to quickly detect recursive calls without an array + inline tricks to optimize the fast and slow paths. Unlike the PR: 1. Removes the need for a state maintained by the compiler. This keeps the ability for an older compiler to compile a new release of the compiler (or use a newer stdlib) but breaks the ability for a new compiler to compile an older release (or use an older stdlib)! 2. Doesn't use atomics: we still use a mutex that already guarantees the acquire/release memory ordering semantics, and __crystal_once_init is only ever called once in the main thread before any other thread can be started.

@BlobCodes

Based on the PR by @BlobCodes: crystal-lang#15216 The performance improvement is two-fold: 1. the usage of a i8 instead of an i1 boolean to have 3 states instead of 2, which permits to quickly detect recursive calls without an array; 2. inline tricks to optimize the fast and slow paths. Unlike the PR: 1. Doesn't use atomics: it already uses a mutex that guarantees acquire release memory ordering semantics, and __crystal_once_init is only ever called in the main thread before any other thread is started. 2. Removes the need for a state maintained by the compiler, yet keeps forward and backward compatibility (both signatures are supported).

@BlobCodes

Based on the PR by @BlobCodes: crystal-lang#15216 The performance improvement is two-fold: 1. the usage of a i8 instead of an i1 boolean to have 3 states instead of 2, which permits to quickly detect recursive calls without an array; 2. inline tricks to optimize the fast and slow paths. Unlike the PR: 1. Doesn't use atomics: it already uses a mutex that guarantees acquire release memory ordering semantics, and __crystal_once_init is only ever called in the main thread before any other thread is started. 2. Removes the need for a state maintained by the compiler, yet keeps forward and backward compatibility (both signatures are supported).

@BlobCodes

Co-authored-by: David Keller <[email protected]> Based on the PR by @BlobCodes: crystal-lang#15216 The performance improvement is two-fold: 1. the usage of a i8 instead of an i1 boolean to have 3 states instead of 2, which permits to quickly detect recursive calls without an array; 2. inline tricks to optimize the fast and slow paths. Unlike the PR: 1. Doesn't use atomics: it already uses a mutex that guarantees acquire release memory ordering semantics, and __crystal_once_init is only ever called in the main thread before any other thread is started. 2. Removes the need for a state maintained by the compiler, yet keeps forward and backward compatibility (both signatures are supported).

@BlobCodes

Co-authored-by: David Keller <[email protected]> Based on the PR by @BlobCodes: crystal-lang#15216 The performance improvement is two-fold: 1. the usage of a i8 instead of an i1 boolean to have 3 states instead of 2, which permits to quickly detect recursive calls without an array; 2. inline tricks to optimize the fast and slow paths. Unlike the PR: 1. Doesn't use atomics: it already uses a mutex that guarantees acquire release memory ordering semantics, and __crystal_once_init is only ever called in the main thread before any other thread is started. 2. Removes the need for a state maintained by the compiler, yet keeps forward and backward compatibility (both signatures are supported).

BlobCodes added 2 commits November 23, 2024 04:43

Perf: More efficient __crystal_once

a8fdff3

Fix mutex deadlock on race-condition

31916e2

BlobCodes changed the title ~~Improve __crystal_once performance~~ Improve __crystal_once performance Nov 23, 2024

Sija reviewed Nov 23, 2024

View reviewed changes

src/crystal/once.cr Outdated Show resolved Hide resolved

Update src/crystal/once.cr

86cb037

Co-authored-by: Sijawusz Pur Rahnama <[email protected]>

ysbaddaden mentioned this pull request Jan 9, 2025

Rework initialization of constants & class variables #15333

Merged

straight-shoota closed this in #15333 Jan 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve `__crystal_once` performance #15216

Improve `__crystal_once` performance #15216

BlobCodes commented Nov 23, 2024

HertzDevil commented Nov 23, 2024

BlobCodes commented Nov 23, 2024

HertzDevil commented Nov 23, 2024

straight-shoota commented Nov 23, 2024

ysbaddaden commented Nov 23, 2024

straight-shoota commented Nov 23, 2024

BlobCodes commented Nov 23, 2024

straight-shoota commented Nov 23, 2024

Improve __crystal_once performance #15216

Improve __crystal_once performance #15216

Conversation

BlobCodes commented Nov 23, 2024

Description

Let's talk numbers

HertzDevil commented Nov 23, 2024

BlobCodes commented Nov 23, 2024

HertzDevil commented Nov 23, 2024

straight-shoota commented Nov 23, 2024

ysbaddaden commented Nov 23, 2024

straight-shoota commented Nov 23, 2024

BlobCodes commented Nov 23, 2024

straight-shoota commented Nov 23, 2024

Improve `__crystal_once` performance #15216

Improve `__crystal_once` performance #15216